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Abstract 

Many real-world applications are associated with structured data, where not only 
input but also output has interplay. However, typical classification and regres¬ 
sion models often lack the ability of simultaneously exploring high-order inter¬ 
action within input and that within output. In this paper, we present a deep 
learning model aiming to generate a powerful nonlinear functional mapping from 
structured input to structured output. More specifically, we propose to integrate 
high-order hidden units, guided discriminative pretraining, and high-order auto¬ 
encoders for this purpose. We evaluate the model with three datasets, and obtain 
state-of-the-art performances among competitive methods. Our current work fo¬ 
cuses on structured output regression, which is a less explored area, although the 
model can be extended to handle structured label classification. 


1 Introduction 

Problems of predicting structured output span a wide range of fields, including natural lan¬ 
guage understanding, speech processing, bioinfomatics, image processing, and computer vision, 
amongst others. Structured learning or prediction has been approached with many different mod¬ 
els lfni5ll8ll9l fT^ . such as graphical models Q, large margin-based approaches iflTl . and conditional 
restricted Boltzmann machines m. Compared with structured label classification, structured out¬ 
put regression is a less explored topic in both the machine learning and data mining community. 
Aiming at regression tasks, methods such as continuous conditional random fields lfT3l have also 
been successfully developed. Nevertheless, a property shared by most of these previous methods 
is that they often make explicit and exploit certain structures in the output spaces, which is quite 
limited. 

The past decade has seen the great advance of deep neural networks in modeling high-order, non¬ 
linear interactions. Our work here aims to extend such success to construct nonlinear functional 
mapping from high-order structured input to high-order structured output. To this end, we propose a 
deep High-order Neural Network with Structured Output (HNNSO). The upper layer of the network 
implicitly focuses on modeling interactions among output, with a high order anto-encoder that aims 
to recover correlations in the predicted multiple outputs; the lower layer network contributes to 
capture high-order input structures, using bilinear tensor products; and the middle layer constructs 
a mapping from input to output. In particular, we introduce a discriminative pretraining approach to 
guiding the focuses of these different layers of networks. 

To the best of our knowledge, our model is the first attempt to construct deep learning schemes for 
structured output regression with high-order interactions. We evaluate and analyze the proposed 
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model on multiple datasets: one from natural language understanding and two from image process¬ 
ing. We show state-of-the-art predictive performances of our proposed strategy in comparison to 
other competitive methods. 

2 High-Order Neural Models with Structured Output 

We regard a nonlinear mapping from structured input to structured output as consisting of three 
integral and complementary components in a high-order neural network. We name it as High- 
order Neural Network with Structured Output (HNNSO). Specifically, given a D x N input matrix 
\Xi,... aDxM output matrix [ki,..., we aim to model the underlying mapping 

/ between the inputs X^ G and the outputs G Figure [T]presents a specific implementa¬ 

tion of HNNSO. Note that other variants are allowed; for example, the dot rectangle may implement 
multiple layers. The top layer network is a high-order de-noising auto-encoder (the green portion 
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Figure 1: A specific implementation of a high-order neural network with structured output. 

of Figure [T]). In general, an auto-encoder is used for denoising input data. In our model, we use it 
to denoise the predicted output resulting from the lower layers, so as to capture the interplays 
among output. Similar to the strategy employed by Memisevic in ifTOll . during training, we randomly 
corrupt a portion of gold labels, and the perturbed data are then fed to the auto-encoder. The hidden 
unit activations of the auto-encoder are first calculated by combining two versions of such corrupted 
gold labels, using a tensor T® to capture their multiplicative interactions. Subsequently, the hidden 
layer is used to gate the top tensor T‘^ to recover the true labels from the perturbed gold labels. As 
a result, the corrupted data force the encoder to reconstruct the true labels, in which the tensors and 
the hidden layer encode the covariance patterns among the output during reconstruction. 

The bottom layer (red portion of Figure [T]i describes a bilinear tensor-based network to multiplica- 
tively relate input vectors, in which a third-order tensor accumulates evidence from a set of quadratic 
functions of the input vectors. In our implementation, as in El , each input vector is a concatenation 
of two vectors. Unlike El, we here concatenate two dependent vectors: the input unit X {X G 
and its non-linear, first-order projected vector h(X). Hence, the model explores the high-order mul¬ 
tiplicative interplays not just among X but also with the non-linearly projected vector h{X). 

We also leverage discriminative pretraining to help construct our functional mapping from structured 
input to structured output, in which we guide HNNSO to model the interdependency among output, 
among input, as well as that between input and output, where different layers of the network focus 
on different types of structures. Specifically, we pre-train the networks layer-by-layer in a bottom up 
fashion, using the gold output labels. The inputs to the second layer and above are the outputs of the 
layer right below it, except for the top layer where the corrupted gold output labels are used as input. 
Doing so, the bottom layer is able to focus on capturing the input structures, and the top layer can 
concentrate on encoding complex interaction patterns among output. Importantly, the pre-training 
also makes sure that when fine-tuning the whole networks (will be discussed later), the inputs to the 
auto-encoder have closer distributions and structured patterns as that of the true labels (as will be 
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seen in the experimental section). Consequently, the pre-train helps the auto-encoder to have inputs 
with similar structures in both learning and prediction making. Finally, we perform fine-tuning to 
simultaneously optimize all the parameters of the three layers. Unlike in the pretraining, we use the 
uncorrupted outputs resulting from the second layer as the input to the auto-encoder. 

Model Formulation and Learning As illustrated in the red portion of Figure [1] HNNSO first 
calculates quadratic interactions among the input and its nonlinear transformation. In detail, it first 
computes the hidden vector from the provided input X. For simplicity, we apply a standard linear 
neural network layer (with weight and bias term 6®) followed by the tank transformation; 
= tanh{W^X + 6^),where tanh{z) = fr^frrr- Next, the first layer outputs are calculated as: 


= tanh{ 


rj-X 






( 1 ) 


The term 


+ ) here is similar to the standard linear neural network layer. The addition 


term is a bilinear tensor product with a third-order tensor T“. The tensor relates two vectors, each 
concatenating the input unit X with the learned hidden vector h^. The computation for the second 
hidden layer is similar to that of the first hidden layer When learning the de-nosing auto¬ 
encoder layer (green portion of Figure [T]i, the encoder takes two copies of the input, namely 
and feeds their pair-wise products into the hidden tensor, i.e., the encoding tensor 




( 2 ) 


Next, a hidden decoding tensor is used to multiplicatively combine /i® with the input vector Y^^'^ 
to reconstruct the final output U . Through minimizing the reconstruction error, the hidden tensors 
are forced to learn the covariance patterns within the final output 

In our study, we use an auto-encoder with tied parameters for convenience. That is, the same tensor 
for T® and T‘^. Also, de-noising is applied to prevent an overcomplete hidden layer from learning 
the trivial identity mapping between the input and output. In the de-noising process, the two copies 
of inputs are corrupted independently. In our implementation, all model parameters can be learned 
by gradient-based optimization. We minimize over all input instances (Xi, Yi) the sum-squared loss 
error (note: cross-entropy will be used for classification tasks) between the output vector on the top 
layer and the true label vector; 


N 

m = Y,E,{x,,Y,-e) Y \\\e\\l 

2=1 


(4) 


Also, we employ standard L 2 regularization for all the parameters, weighted by A. For our non- 
convex objective function here, we deploy the AdaGrad |[3 to search for the optimal model param¬ 
eters. 


3 Experiments 

Baselines 

We compared HNNSO’s predictive performance, in terms of Root Mean Square Error (RMSE), with 
six regression models: (1) the Multi-Objective Decision Trees (MODTs) |l2l|6l; (2) a collection of 
Support Vector Regression (denoted as SVM-Reg) ESI with RBE kernel, each for one target at¬ 
tribute; (3) a traditional neural network, i.e., the Multiple Layer Perceptron (MLP) with one hidden 
layer and multiple output nodes; (4) the so-called multivariate multiple regression (denoted as Mul- 
tivariateReg), which takes into account the correlations among the multiple targets using a matrix 
computation; (5) an approach that stacks the MultivariateReg on top of the MLP (denoted MLP- 
MultivariateReg); and (6) the Gaussian Conditional Random fields (GaussianCRP) Bl [13111411 . in 
which the outputs from a MLP were used as the CRP’s node features, and the square of the distance 
between two target variables was modeled by an edge feature. In our experiments, all the parameters 
of these baselines have been carefully tuned. 
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Methods 

SSTB 

MNIST 

USPS 

RMSE 

relative error 
reduction 

RMSE 

relative error 
reduction 

RMSE 

relative error 
reduction 

MODTs 

0.0567 

34.2% 

0.0739 

33.1% 

0.6487 

13.8% 

SVM-Reg 

0.0452 

17.4% 

0.0602 

17.9% 

0.5977 

6.4% 

MLP 

0.0721 

48.2% 

0.0800 

38.2% 

0.6683 

16.3% 

MultivariateReg 

0.0614 

39.2% 

0.1097 

54.9% 

0.6169 

9.3% 

MLP-MultivariateReg 

0.0705 

47.0% 

0.0791 

37.5% 

0.6059 

7.7% 

Gaussian-CRF 

0.0706 

47.1% 

0.0800 

38.2% 

0.6047 

7.5% 

HNNSO 

0.0373 

- 

0.0494 

- 

0.5591 

- 


Table 1; Ten-fold averaged RMSE scores of models on the SSTB, MNIST, and USPS data. The 
differences of HNNSO from other models are statistically significant at the 95% significance level. 

layer two output (no pretrain) layer two output (pretrain) true labels 



Figure 2: Effect of pretraining; the distributions of the predicted with pretraining (middle) 

were closer to the true labels (right), compared to the non pretrained version (left). 


Datasets 

There recently have been a surge of interests in using real-valued, low-dimentional vector to rep¬ 
resent a word or a sentence in the natural language processing (NLP). Our first experiment was 
set up in such a circumstance. Specifically, we used the Stanford Sentiment Tree Bank (SSTB) 
dataset IfT^ that contains 11,855 movie review sentences. In the best embeddings reported in 
oa, each sentence is represented by a 25-dimensional vector. We obtained these vectors from 
http://nlp.stanford.edu/sentiment/ and used the first 15 elements to predict the last 10 dimensions. 
Our second experiment used 10,000 examples from the test set of MNIST digit database Q. On pur¬ 
pose, we employed PCA to reduce the dimension of the data to 30, resulting in 30 PCA components 
that are pair-wise, linearly independent to each other. In our experiment, we used the first 15 di¬ 
mensions to predict the last 15 dimensions. Our last experiment used the USPS handwritten digit 
database B We randomly sampled 1100 images from the original data set, and used the first half of 
the image (128 pixels) to predict the second half (128 pixels) of the image. 
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Figure 3: Effect of the auto-encoder: transform- Figure 4: Errors made by the SVM-Reg (green) 
ing input (gray) to output (light blue). and HNNSO (red) for each target. 


'http://yann.lecun.com/exdb/mnist/ 
^http://www.cs.nyu.edu/ roweis/data/usps_all.mat 
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Figure 5: Predicting the right half of a digit using the left half in the USPS data 


General Performance 

Table [T] presents the performance of different regression models on the SSTB, MNIST, and USPS 
datasets. The results show that the HNNSO achieves significantly lower RMSE scores in comparison 
to other models. On all three datasets, the relative error reduction achieved by HNNSO over other 
methods was at least 6.4% (ranging between 6.4% and 54.9%). 

Detailed Anaylsis 

We use the SSTB dataset to gain some insights into the HNNSO’s modeling behavior. Performance- 
wise, we have shown above that the HNNSO model achieved a RMSE score of 0.0373 on the 
SSTB data. Without pretraining, the error increases relatively by 9.4%. Eigure further depicts 
the distribution of the hrst output variable of the data. The figure indicates that the distribution 
of the input with pretraining (middle), compared to that without pretraining (left), is closer to the 
distribution of the true labels (right). Such structured patterns are important for the encoder as 
discussed earlier. 

In Eigure [2 we also show the input (gray boxes) and output (light-blue) of the auto-decoder in 
HNNSO as well as the true labels (dark-blue) on the SSTB data. Each box in each color group 
represents one of the ten output variables in the same order. Eigure [3] shows that the patterns of the 
light-blue boxes are similar to that of the dark-blue boxes. This suggests that the encoder is able to 
guide the output predictions to follow similar structured patterns as that of the true labels. 

In Eigure |4] we further depict the errors made by the HNNSO and SVM-Reg (the second best 
approach). Each box in each color group represents the error, calculated as predicted value minus 
its true value, achieved on each of the ten output variables in the same order. Eigure |4] suggests that 
the errors on each output target made by HNNSO has narrow and consistent variances across the ten 
output targets. On the contrary, the variances of errors among the ten output targets obtained by the 
SVM-Reg are obviously larger, suggesting that SVM-Reg makes good prediction on some output 
targets without considering the interactions with other targets. 

Visualization 

Eigure|5]plots three digits from the USPS data, including the true images (right) and their predictions 
made by HNNSO (left) and MLP (middle). The hgure shows that HNNSO was able to recover the 
images well. In contrast, MLP yielded some missing pixels on the right halves of the images. 


4 Conclusion 


We propose a deep high-order neural network to construct nonlinear functional mappings from struc¬ 
tured input to structured output for regression. We aim to jointly achieve the goal with complemen¬ 
tary components that focus on capturing different types of interdependency. Experimental results on 
three benchmarking datasets show the advantage of our model over several competing approaches. 
In the future, we plan to explore our strategy with a hinge loss for structured label classification with 
applications in image labeling and scene understanding. 
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